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Abstract 

Background: RNA-seq, a next-generation sequencing based method for transcriptome analysis, is rapidly emerging 
as the method of choice for comprehensive transcript abundance estimation. The accuracy of RNA-seq can be 
highly impacted by the purity of samples. A prominent, outstanding problem in RNA-seq is how to estimate 
transcript abundances in heterogeneous tissues, where a sample is composed of more than one cell type and the 
inhomogeneity can substantially confound the transcript abundance estimation of each individual cell type. 
Although experimental methods have been proposed to dissect multiple distinct cell types, computationally 
"deconvoluting" heterogeneous tissues provides an attractive alternative, since it keeps the tissue sample as well as 
the subsequent molecular content yield intact. 

Results: Here we propose a probabilistic model-based approach, Transcript Estimation from Mixed Tissue samples 
(TEMT), to estimate the transcript abundances of each cell type of interest from RNA-seq data of heterogeneous 
tissue samples. TEMT incorporates positional and sequence-specific biases, and its online EM algorithm only 
requires a runtime proportional to the data size and a small constant memory. We test the proposed method on 
both simulation data and recently released ENCODE data, and show that TEMT significantly outperforms current 
state-of-the-art methods that do not take tissue heterogeneity into account. Currently, TEMT only resolves the 
tissue heterogeneity resulting from two cell types, but it can be extended to handle tissue heterogeneity resulting 
from multi cell types. TEMT is written in python, and is freely available at https://github.com/uci-cbclATEMT 

Conclusions: The probabilistic model-based approach proposed here provides a new method for analyzing RNA- 
seq data from heterogeneous tissue samples. By applying the method to both simulation data and ENCODE data, 
we show that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript 
abundance estimation. 



Background 

The rapidly advancing next-generation sequencing based 
transcriptome analysis tool, RNA-seq, provides a com- 
prehensive and accurate method for analyzing the entire 
RNA components of the transcriptome [1]. The effi- 
ciency and sensitivity of RNA-seq make it a primary 
method for detecting alternatively-spliced forms and 
estimating their abundances [2,3]. However, estimating 
transcript abundances in heterogeneous tissues by RNA- 
seq remains an unsolved, outstanding problem because 
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of the confounding effect from different cell types [4]. 
Many tissue samples from native environments are het- 
erogeneous. For example, tumor samples are usually 
composed of tumor cells and surrounding normal cells 
[5]. Therefore, reads from an RNA-seq experiment of 
tumor samples will consist of contributions from both 
tumor and normal cells. Additionally, tumor tissues 
themselves are often heterogeneous, consisting of differ- 
ent subclones (e.g. breast cancer subtypes [6]), leading 
to even more complicated tissue environments. 

Experimental methods have been proposed to address 
issues arising from contamination of different cell types, 
such as laser-capture microdissection [7], which allows 
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dissection of morphologically distinguishable cell types. 
The mRNA content yield by this technology is conse- 
quently lowered, and needs to be compensated for, 
usually by molecular amplification. However, the nonli- 
nearity induced by amplifying mRNA [8] has its own 
problems, and can make the expression profiles of dis- 
tinct cell types less distinguishable, weakening the sensi- 
tivity of RNA-seq technology. Other experimental 
approaches, including cell purification and enrichment, 
are comparatively expensive and laborious [9]. Therefore 
developing alternative in silico approaches to resolving 
the tissue heterogeneity problem, especially in cancer 
research, remains a major problem in RNA-seq analysis 
[10]. 

Research in computational approaches to resolving the 
tissue heterogeneity problem of different biotechnologies 
has a fairly long history [11-14]. The first attempt to 
computationally micro-dissect heterogeneous tissues for 
microarray expression data was based on a linear model 
[11], which estimated both cell-type proportion and gene 
expression level. Prior information regarding "marker 
genes", which are genes uniquely expressed in each cell- 
type, was incorporated into the linear model to identify 
distinct cell types. The linear model was extended with 
Bayesian prior densities of cell-type proportions [13], and 
a posterior sampling approach was then constructed for 
cell-type-specific expression profiling. A statistical testing 
method [14] was proposed for single nucleotide poly- 
morphism (SNP) array based copy number alterations 
analysis from heterogeneous tissue samples. In this 
method, Bayesian differentiation between hemizygous 
deletion and homozygous deletion were used to infer the 
underlying normal cell proportion and copy number pro- 
files of both normal cells and tumor cells. One common 
feature shared by these methods is that they all adopted 
probabilistic models, not only allowing prior information 
about different cell types to be smoothly incorporated 
into the models, but also taking advantages of the flex- 
ibility of probabilistic model to capture specific aspects of 
each data type. 

To the best of our knowledge, no computational 
approaches have been proposed to resolve the tissue het- 
erogeneity problem from RNA-seq data in a probabilistic 
fashion. Typically, researchers apply transcriptional pro- 
filing tools designed for homogeneous tissue samples 
directly to RNA-seq data from heterogeneous tissue sam- 
ples. Subsequent estimation results are interpreted as 
transcriptional profiling of a particular single cell type of 
interest. Therefore, we ask whether it is possible to esti- 
mate trancriptive abundances of individual cell types 
from RNA-seq of heterogeneous tissues, by decoupling 
the contributions from multiple cell types. We propose a 
probabilistic model-based approach, Transcript Estima- 
tion from Mixed Tissue samples (TEMT) to address this 



question. Currently, TEMT requires two sets of single- 
end RNA-seq reads. One read set is from a heteroge- 
neous tissue sample composed of two cell types, while 
the other is from a pure tissue sample composed of one 
of the two cell types. TEMT incorporates prior informa- 
tion of cell type proportion and can calculate probabil- 
ities of RNA-seq reads sampled from each cell type. 
Because TEMT implements an online EM algorithm 
[15], it has a time requirement proportional to the data 
size and a constant memory requirement. To further 
improve the estimation accuracy, TEMT also implements 
a bias module, which incorporates both positional bias 
[16-18] and sequence-specific bias [19,20]. 

To assess the performance of TEMT, we analyzed a 
series of both simulation and real data from ENCODE 
[21], and compared the transcript relative abundances 
estimation from TEMT to those obtained from other 
methods that do not take the tissue heterogeneity into 
account. Our results show that explicitly accounting for 
tissue heterogeneity can significantly improve transcript 
abundance estimation accuracy. 

Methods 

In this section, we first introduce the generative mixture 
model of TEMT. Combined with cell type proportion as 
prior information, we propose a maximum a posteriori 
estimation approach for finding model parameters. 
Next, we explain how to incorporate a positional and 
sequence-specific bias module into the model. Finally, 
we introduce an online EM algorithm for parameter 
estimation, reducing the time complexity to be propor- 
tional to the data size and the space complexity to be 
constant. 

Model 

Basic definition 

We focus on transcript abundance estimation. Denote T 
as a set of reference transcripts, which we assume is 
known and complete. Let l t denote the length of transcript 
t in the set with t = 1, • • • , T, where T is the total number 
of transcripts in the reference set. Suppose we are inter- 
ested in transcriptome analysis in two cell types: a and b. 
Let pf and p\ denote the relative transcript abundance of 
transcript t in cell type a and b, respectively, with 
t = 1, ■ ■ ■ , T. We assume {pf }J =1 are {pfjjLj properly nor- 
malized such that J^ti Pt = 1 and J2t=i Pt = L 

We assume RNA-seq reads are available in two samples: 
one consisting of cells of only type a, which we call the 
"pure sample", and the other consisting of cells of both 
type a and b with percentage T" from cell type a and "J~ b 
from cell type b, which we call the "mixed sample." In the 
cancer transcriptome analysis, cell type a can represent 
normal cells as it is usually easy to obtain a pure tissue 
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sample, while cell type b can represent tumor cells as most 
tumor tissue samples are contaminated by normal cells. 

Because the pure sample consists of only cell type a, 
its relative transcript abundance p\ is described by 
p P = pf for all t. However, the relative abundance of 
transcript t within the mixed sample is a weighted sum 
of the transcript abundance of both cell type a and b 



P't 



1 



(1) 



Denote the read set from the pure sample by W and 
the read set from the mixed sample by ~R, m . Our goal is 
to estimate the relative abundance of each transcript in 
the reference set T from the RNA-seq read data TZ? and 
lZ m in both cell type a and b 

Alignment representation 

We first map reads to the reference transcript set X 
and convert the raw read data into a correspond- 
ing alignment representation. Denote the align- 
ment representation of the read set W by yP = 
= 1, • • • , N p , t = I, ■ ■ ■ , T}, where y? t = 1 if read i 
from TZP aligns to transcript t and 0 otherwise, and AT is 
the total number of reads in read set TZP. The alignment 
representation y m = [y£\i = 1, • • • , N m , t = 1, ■■■ , T} is 
similarly defined for read set TZ m from the mixed sample. 
Note that one read might map to multiple transcripts due 
to alternative splicing, sequence similarity shared by homo- 
logous genes, or other reasons. As a result, the summation 
of y? ( over all transcripts may be bigger than 1 for some i. 
These "ambiguous reads" introduce a major source of 
uncertainty into transcript abundance estimation. 



X' 



4>{x;p,,a 2 ) 

7t Ei<=i </>(*'; m^ 2 ) 



{l t -x + 1) 



(3) 



We assume the fragment length x has a normal distri- 
bution with mean n and variance o 2 , and cj)(x; fi, a 1 ) is 
the normal probability density function of. By renorma- 
lizing <p(x; [i, a 2 ), we obtain the discrete distribution of 
all possible fragment lengths. The effective length ~\ t is 
then the expectation of the number of positions a read 
can start within transcript t, based on the discrete distri- 
bution of fragment length. 

Suppose a read is generated uniformly from each loca- 
tion covered by the effective length of each transcript. 
Then the probability of observing read i as represented 
by its alignment map is 



Wu£i)-E&r 



(4) 



for 5 = p or m. 

Assume each read is generated independently in both 
the pure and the mixed samples. The likelihood of 
observing the read set ^from the pure sample and TZ m 
from the mixed sample is then described by 

NP T p N m T m 

F(TZP, TZ m \{a p t }J =1 , K}L) = UT,itrUT,^r i5) 

i=1 1=1 'l i=1 (=1 ' f 



We are interested in estimating the relative transcript 
abundances set {pf}J =1 , {pfjjLj tmt since it can be 
uniquely defined by the read sampling probability set 



Generative model 

We model the sequencing of reads as a sampling pro- 
cess, randomly chooses a transcript t from the reference 
transcript set T according to its relative abundance and 
effective length, and then generates a read from a ran- 
dom location of the chosen transcript. Under this 
model, the probability of a read originating from tran- 
script t is 



(2) 



with s being either p for the pure sample or m for the 
mixed sample. Here, \ is the effective length of tran- 
script t, which quantifies the number of positions at 
which a read can start within transcript t. Different 
methods have been proposed to model the effective 
length [19,22]. In TEMT, the effective length is modeled 
with consideration to the length distribution of RNA- 
seq fragments [19] 



Pt 



(6) 



We can directly estimate the read sampling probability 
{af}J =1 , {a h }J =1 set from the likelihood function Equation (5) 
instead. Note that, again a p t = a? for all t as it is the para- 
meter of pure sample, but unlike the linear form in Equa- 
tion (1), a™ in terms of af, a b is given as a nonlinear form 



K a r a a a t + A b r b a b 



A" = 



J2k=i Pp± A b = 



EL 



Efe=i 



(7) 



(8) 



Where, the factor A", A b induce the nonlinearity. But 
due to the averaging effect of the large number of tran- 
scripts, practically A", A b lies within 1 ± 0:05. So we 
approximate af with the linear form 



r a af + t V 



(9) 
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As it brings computational convenience in the follow- 
ing learning step. 
Finally, we define 

© = {{«?&!, ( 10 ) 

as the parameters of our model. The likelihood in 
Equation (5) can be then expressed as 

NP T Q,a N m T ( r a a a r h ot°\ 

pyv, ft m ie) = n Erf t^nErT r < n > 

i=l t=l l t i=l t=l (t 
Maximum a posteriori estimation 

Several analysis have noticed the identifiability problem 
[12,13] in estimating cell type specific expression in het- 
erogeneous tissue samples. Ideally, if the proportion 
information for some cell types is missing, we can then 
pool these cell types as one type, making the expression 
of each individual cell type inside unidentifiable. Pre- 
viously, prior constraints have been used to resolve the 
problem [12,13]. In our model, the prior knowledge of 
cell type proportions is combined with the model likeli- 
hood, and we subsequently use maximum a posteriori 
(MAP) estimation to find the optimal parameters. 

Specifically, we place a Beta [fi a , f} b ) distribution as 
the prior for cell proportions of type a and type b. The 
parameter p a l p b quantify the location and sharpness of 
the prior. Practically, we found setting f}", fi b 10 times 
as the data size gave a good convergence rate and accu- 
racy. Combining the prior with the likelihood given in 
Equation (11), the posterior distribution of the model is 
proportional to 

riE^r nE>-;, "T (ry-w- 1 (12) 

i-1 »-l / L i-1 i-1 '' J 



V { t G [0, l t ] as the starting position of the alignment 
within transcript t relative to the 5' end of the strand. 
We also denote n[ ( e E L , where T. = {A, C, G, T], as 
the local sequence of transcript t with length L and cen- 
tered at V it t .Then we define the bias weight u>? t as 

P(^ t |bias)P(^ t |bias) 
u P(^ t |uniform)P(^ £ |uniform) ( ' 

for s=p or m. 

This bias weight t is essentially the ratio of the 
probability of observing b\ t and 7t- t under the bias 
model to the probability under the uniform model. If no 
bias exists, the weight m^j reduces to 1. The bias re- 
weighted Equation (4) is then: 

myi t )Li) = Y,ti.tT<t (14) 

To calculate the bias weight, we use the bin method 
and Markov chain for positional bias and sequence-spe- 
cific bias respectively. Complete details can be found in 
the Supplementary (Additional file 1). The final unnor- 
malized posterior distribution of the model is then 
described as 

WE/uj<,J I nEa '{ ' X . I (r-r-Vf-' (15) 

Where tJ[ and u>™ t are the bias weights computed 
based on read set HP and lZ m . The directed graphical 
model of TEMT is shown in Figure 1. The estimated 
parameters are given by 

0 = argmax\ogV{@\TZ p , K m ) (ig) 

e 



Incorporating sequencing bias 

Both positional [16-18] and sequence-specific [19,20] 
sequencing biases have been observed in next generation 
sequencing data. These biases mainly result from non- 
uniformly distributed cDNA fragments during the RNA- 
seq library preparation [20]. Under positional bias, reads 
positioning is not uniformly distributed across the effec- 
tive length of the target transcript, but preferentially dis- 
tributed around either the 5' end or the 3' end of the 
target transcript. Under sequence-specific bias, the 
sequences near the two ends of the fragments affect 
their probability to be sequenced. To account for these 
non-uniformity effects during transcript abundance esti- 
mation, we incorporate the bias module of [19] into our 
model. 

In order to further describe the local alignment con- 
text, we define another two sets of variables. Specifically, 
for read i from either read set W or TZ m , we denote 



Online EM algorithm for learning 

We solve the maximum a posteriori problem in Equation 
(16) using the Expectation-Maximization (EM) [23] frame- 
work. For each read i from read set TZP of pure sample, we 
denote the latent variable of the transcript alignment 
representation as Z- = {z p it \t =1, • • • , T}, where = lif 
read i aligns to transcript t and 0 otherwise. But now 

T 

E^ft = 1> which means only one z P it = 1, indicating read i 

is actually originating from transcript t. Similarly, for each 
read i from read set TZ m of mixed sample, we denote the 
latent variable of the transcript alignment representation 
as Z, m = [z™, z™j>\t = 1, • • • , T), where z™ = 1 if read i 
aligns to transcript t and is originating from cell type a 
within the mixed sample, and 0 otherwise. z™ b = 1 or 0 is 
similar defined for cell type b. Thus EL (<T + = 1 
means read i is actually originating from only one 
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transcript, and either from cell type a or b within the 
mixed sample. We also define the auxiliary variable 

<? = p ( z r? = i \ & ' y p ' y m )> c = r Ki = y p - y m ) 

and qf t = P{zf t = 1|0, yP, y m ) as the conditional prob- 
ability weight of each latent variable £ = 1, z£f = 1 and 
z"* = 1 conditional on model parameters 0 and the 
observed read alignment representations yfy m . Then 
based on Jensen's inequality [24], the complete posterior 
distribution, which is also the lower bound of Equation (15) 
can be written as 



p(0ik», k-)>- nn< 



nn^-s)*^-*) Wr-'fY-' (17) 



In which C is a normalizing constant and the equality 
holds only if the conditional probabilities cf { t , cf$, 
are the true posterior distributions of latent variables 

The EM framework maximizes Equation (17) by itera- 
tively applying the expectation step and the maximiza- 
tion step to update both the conditional probabilities 
cf££, cl™ b t and model parameters 0 until convergence. 
The expectation step of typical batch EM algorithm has 
to fetch all the data points into memory, and calculates 
the conditional probabilities based on the average of all 
the data points. While this batch method guarantee's 
the log-likelihood function to monotonically increase, it 
also induces inefficiency in both time and space 



complexity. Considering the high-throughput nature of 
next-generation sequencing technology as well as its 
huge data size, we implemented the EM algorithm in an 
online fashion [15] to both lower the memory require- 
ment and boost the convergence rate. 

The main difference between the batch EM and the 
online EM is in the E-step. The E-step of the online EM 
algorithm first calculates the conditional probabilities of 
only one new data point, and then updates the condi- 
tional probabilities of all the current data points by inter- 
polating between the conditional probabilities of all the 
previous data points and the conditional probabilities of 
the new data point, with a forgetting factor a controlling 
the convergence rate. 

It is shown in [15] that with the constraint 0.5 < <7 < 1, 
the online EM algorithm is asymptotically equivalent to 
stochastic gradient ascent, and is guaranteed to converge 
to the maximum likelihood estimator, which is extended 
to the maximum a posteriori estimator in our model. 

Specifically, the online EM updates in our model is 
given by 

E-step 



p n*i,i i w i,t 

k ~, 



(18) 
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In Equation (18-20), we compute the conditional prob- 
abilities cf i+1 t , (fu\,p t °f ) us *- one new rea d i + 1 
based on previous parameter estimation [<Xf^}J =1 > 
r a{n) i x b{n\ x a(n) _ r b(n) ; j n Equation (21-23), we compute 
the new conditional probabilities average q^f- n+1 \ 
^mb(n+i) interpolating between the previous conditional 
probabilities average and cf i+ht , Ci,t> 

d™j t . n is the index of iteration step and i is the index of 
data points, a is the forgetting factor which controls the 
convergence rate, with the constraint 0.5 < a < 1. 



M-step 



t=iH*,t + 



N m 



1 + 



(24) 



T k(n+1) 



ET rf mb(n+l) , 
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a(n+l) _ "*,t + "*,t 
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(25) 



(26) 



(27) 



In the subsequent M-step, parameters {af'" +1 ^}f =1 > 
T "( n +i), T fl («+i), t b(n+i) are updated according to new con- 
ditional probabilities average 4™ (n+1) , <£* t ( " +1) . 



Results 

Next we test the performance of the proposed method 
on both simulation data and the recently released 
ENCODE data [21]. For both datasets, we used the fol- 
lowing three-step protocol and parameters to construct 
the analysis: 

1. We aligned the raw read set from either simulation 
or the ENCODE data to a given transcript set using 
bowtie-0.12.7 [25]. For each read, we allowed 2 mis- 
matches and reported at most 10 candidate alignments. 

2. The abundance of each transcript in terms of esti- 
mated counts was estimated via both TEMT and a con- 
trol model. Estimated counts is defined as the estimated 
number of reads generated from the target transcript. In 
TEMT, the prior of each cell type proportion was set to 
the same as the proportion used in simulation and 
ENCODE data respectively, and fi a , j3 b was set to 10 
times the size of the read set lZ m . ft = 200; a = 80 were 
used as the mean and standard deviation of the RNA-seq 
fragment length distribution. We chose eXpress-0.9.4 
[26] as the control model, as it is the state-of-the-art 
method for transcript abundance estimation and also uti- 
lizes an online-EM algorithm. Note that, to run TEMT, 
we need two read sets, in which one is for the pure sam- 
ple and the other is for the mixed sample, as previously 
mentioned. In contrast, to run eXpress, we only need one 
read set from either the pure sample or the mixed sam- 
ple. The forgetting factor for the on-line EM algorithms 
in both TEMT and eXpress was set to be a = 0:85, and 
the error-model in eXpress was disabled for comparison. 

3. To measure the model accuracy, we used the Error 
Fraction (EF) measure introduced by [17] to quantify 
the discrepancy between the model estimates and the 
ground truth estimates. The Error Fraction is defined as 
the fraction of transcripts for which the estimates are 
significantly different (percent error >10% in our case) 
from the ground truth. 

Simulation 
Data preparation 

To show the utility of TEMT, we first carried out a ser- 
ies of simulation studies. To obtain simulated read sets, 
we used FluxSimulator [27], a software for transcrip- 
tome and read generation by simulating the biochemical 
processes underlying the library preparation. FluxSimu- 
lator requires a reference transcript set to start the 
simulation process, so we manually downloaded 406 
transcripts of 208 alternatively spliced genes in human 
from Alternative Splicing Structural Genomics Project 
(AS3D) [28], and used these 406 transcripts as the refer- 
ence transcript set. We first simulated the transcript 
expression process twice producing two sets of relative 
transcript abundances, corresponding to cell type a and 
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b respectively. Based on these two transcript abundance 
sets, we then simulated 6 pairs of 1 million 75-bp sin- 
gle-end read sets corresponding to six different cell type 
b proportions from 40% up to 90%. The relative tran- 
script abundances of cell type a and b were kept the 
same throughout these simulations. For each paired 
read set, one read set is for the pure sample composed 
of only cell type a, whereas the other read set is for the 
mixed sample composed of both cell type a and b, 
mixed with the cell type b proportion. Within the 
mixed-sample read set, we also extracted the reads 
simulated purely from cell type b, which was used for 
control model eXpress. 
Analysis 

The simulated data are analyzed with the bias module 
both enabled and disabled. Surprisingly, the positional 
and sequence-specific bias module did not improve the 
accuracy of the transcript abundance estimation as mea- 
sured by the Error Fraction of estimated counts in both 
TEMT and eXpress. This result may due to the stochas- 
ticity during the simulation of FluxSimulator. So we 



only present the results with the bias module disabled 
in both TEMP and eXpress in Figure 2. 

We note that the estimates of cell type a from TEMT 
achieve roughly the same accuracy, compared with the 
estimates from eXpress based on the read set of the 
pure sample of cell type a. Also, this accuracy does not 
change significantly under the effect of different cell 
type b proportions. This is mainly due to the pure sam- 
ple read set of cell type a within the input data for 
TEMT. 

The accuracy of the estimates of cell type b from 
TEMT is also shown in Figure 2, which shows that 
TEMT generally outperforms the direct estimation 
method. To the best of our knowledge, there are no com- 
putational tools similar to our model that can estimate 
the relative transcript abundances of cell type b via RNA- 
seq data generated from mixed samples. Typically, com- 
putational methods are applied directly to the noisy data 
of mixed samples and results are interpreted as the esti- 
mates of cell type b. To compare the estimates of cell 
type b from TEMT with direct estimates using the 
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Figure 2 Analysis results of simulated data of 6 different cell type b proportions with the bias module disabled. The x-axis is the 
different cell type b proportion, and the y-axis is the Error Fraction of the corresponding estimates. The green and blue lines are the estimates 
from TEMT for cell type a and cell type fa, based on the two read sets of the cell type a pure sample and the mixed sample. The yellow and 
magenta lines are the estimates from eXpress for cell type 0 and cell type fa, based on the two read sets of the cell type a pure sample and the 
cell type fa pure sample. The red line is the direct estimates from eXpress for cell type fa, based on the read set of the mixed sample. 
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current method, we applied the control model eXpress 
directly to the read set of the mixed sample. The esti- 
mated counts from eXpress were then compared with 
the true counts from another 1 million simulated read 
set purely of cell type b, while keeping the same relative 
transcript abundance as the previous simulations. The 
corresponding Error Fractions are shown as the red line 
in Figure 2 regarding different cell type b proportions. 
Although the accuracy of cell type b estimates from 
TEMT is affected by different cell type b proportions, it 
is generally better than the direct estimates. This can be 
further illustrated in Figure 3, which shows that the 
direct estimated counts of cell type b from eXpress devi- 
ate more from the true counts as the cell type b propor- 
tion decrease, while the estimates of TEMT have much 
reduced deviation. We notice that as the cell type b pro- 
portion gradually decreases, the accuracy of the estimates 
of cell type b from TEMT also decreases. This is the 
result of the contamination effect from the cell type a 
within the mixed sample. A recent paper [4] also 
observed this similar phenomenon when studying copy 
number aberrations from heterogeneous tumor tissue. 



ENCODE data 
Data preparation 

Next we analyzed the recently released ENCODE data. 
Due to the lack of RNA-seq data sampled from mixed 
tissue samples with known cell type proportions, we 
artificially generated the mixed-sample read sets by mix- 
ing reads obtained from two different cell types. Specifi- 
cally, we chose two Tier 1 cell lines, GM12878 and 
K562, and treated them as cell type a and cell type b 
respectively. The corresponding single-end RNA-seq 
data of these two cell lines, GM78 lx75D A 1 (UCSC 
Accession: wgEncodeEH000125) and K562 lx75D A 1 
(UCSC Accession: wgEncodeEH000126) from the Wold 
lab [29] at Caltech, were download from ENCODE 
(2012). The data downloaded from the same lab under 
similar protocols is intended to reduce the deviation 
resulting from experiments. We then randomly selected 
10 million reads from GM12878 cells to form the read 
set of the pure sample, and 10 million reads from both 
GM12878 and K562 cells using different K562 cells pro- 
portions to form the read set of the mixed sample. Simi- 
lar to the previous simulation study, we extracted the 
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reads purely selected from K562 cells within the mixed 
sample, and used them for the eXpress control model. 
We studied 6 different K562 cells proportions from 40% 
to 90% in order to compare with the previous simula- 
tion study. 36908 human RefSeq [30] transcripts from 
UCSC known genes [31] were used as the transcript set 
for the ENCODE data. 
Analysis 

One major issue in studying the ENCODE data is that 
the ground truth of relative transcript abundance in 
each cell type is unknown. We used the estimates from 
eXpress based on the GM12878 and K562 pure samples 
as the ground truth. Again, the bias module was dis- 
abled for both TEMT and eXpress. The general result of 
ENCODE data is shown in Figure 4. Similar to the 
simulated data, the indirect estimates for K562 cells 
from TEMT generally outperforms the direct estimates 
from eXpress based on the read set of the mixed sam- 
ple. The contamination effect from cell type a within 
the mixed sample observed in Figure 3 is also seen in 
the eXpress analysis of ENCODE data, while TEMT 
does not have this issue. Note that the measure of 



relative transcript abundances as shown in the red line 
of Figure 4 is no longer estimated counts, but reads per 
kilobase of transcript per million mapped reads 
(RPKM), as the total number of reads from K562 cells 
within the mixed sample is less than the total number 
of reads of the mixed sample, so that normalization is 
necessary for comparison. We notice TEMT underper- 
forms direct estimates from eXpress when K562 cells 
proportion equals 90%. Possibly the contamination effect 
of GM12878 cells within the mixed sample is not severe 
enough at this point, as we can imagine the red line in 
Figure 4 will finally reach 0% Error Fraction when K562 
cells proportion reaches 100%. On the other hand, since 
the estimates from eXpress based on the pure sample 
are considered the ground truth, the lower bound Error 
Fraction of K562 cells estimates from TEMT should be 
the same as the Error Fraction of GM12878 cells esti- 
mates, which is around 20% to 30% in Figure 4. 

Discussion 

We formulated our model under the assumption that 
the heterogeneous tissue is only composed of two cell 
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types, but in reality, a heterogeneous tissue might be 
much more complicated, consisting of multiple cell 
types. To relax this constraint, our model needs to be 
further extended to analyze more complex cases in 
which each cell type may have its own subtypes, e.g. 
breast cancer subtypes, leading to a more sophisticated 
heterogeneous tissue environment. Further dissecting 
cell subtype heterogeneity is the next step in refining 
our model. Moving from two cell types to arbitrarily 
many cell types is of great interest, since it may substan- 
tially facilitate transcriptome study of heterogeneous 
tissues. 

One critical component necessary to make our model 
work is the prior information of cell type b proportion, 
which is necessary to resolve the identifiability problem of 
mixed samples. In real experiments, precise prior informa- 
tion regarding cell type proportions may be unavailable. 
One solution in the context of our model is to down 
weight the effect of the prior by decreasing the parameter 
P", j3 b , which adds more uncertainty to the cell mixture 
proportion. However, this approach may decrease the per- 
formance of the model as the uncertainty in cell mixture 
proportion cannot be distinguished from the uncertainty 
in transcript abundance estimation. This observation sug- 
gests another direction to further improving our model 
which is to solely estimate cell type b proportion without 
the prior information. To fulfill this requirement, the iden- 
tifiability problem needs to be resolved as mentioned in 
section 2.3, which turns out to be comparatively hard for 
RNA-seq data. Unlike the heterozygous and homozygous 
deletions in [14], which can be utilized to differentiate 
between the SNP array data generated by normal cells and 
tumor cells, there are no such explicit differences between 
the reads generated by distinct cell types in RNA-seq data, 
thus making the generative mixture model unconstrained. 
The "marker genes" method proposed by [11], which tries 
to distinguish distinct cell types by utilizing genes uniquely 
expressed in each cell type, provides a future potential 
direction to extend the current model. 

Conclusion 

In this article, we propose a probabilistic model-based 
method TEMT to estimate transcript abundance of indivi- 
dual cell types based on RNA-seq data from heteroge- 
neous tissue samples. TEMT utilizes prior information to 
distinguish reads generated by each cell type within the 
heterogeneous tissue sample. Positional and sequence-spe- 
cific biases are also incorporated to improve estimation 
accuracy. TEMT is able to process large datasets as the 
online EM algorithm is adopted to guarantee a time com- 
plexity proportional to the data size and a constant space 
complexity. Our experiments on both simulated datasets 
and ENCODE data shows that explicitly accounting for 



tissue heterogeneity can significantly improve the accuracy 
of transcript abundance estimation. 

Additional material 



Additional file 1: Supplementary. Complete details for calculating 
positional and sequence-specific bias weights. 
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